Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-6952

concatenated compressed files bug with python sdk

Details

    • Bug
    • Status: Triage Needed
    • P2
    • Resolution: Fixed
    • Not applicable
    • 2.14.0
    • sdk-py-core
    • None

    Description

      The Python apache_beam.io.filesystem module has a bug handling concatenated compressed files.

      The PR I will create has two commits:

      1. a new unit test that shows the problem
      2. a fix to the problem.

      The unit test is added to the apache_beam.io.filesystem_test module. It was added to this module because the test: apache_beam.io.textio_test.test_read_gzip_concat does not encounter the problem in the Beam 2.11 and earlier code base because the test data is too small: the data is smaller than read_size, so it goes through logic in the code that avoids the problem in the code. So, this test sets read_size smaller and test data bigger, in order to encounter the problem. It would be difficult to test in the textio_test module, because you'd need very large test data because default read_size is 16MiB, and the ReadFromText interface does not allow you to modify the read_size.

      Attachments

        Activity

          People

            Unassigned Unassigned
            danl Daniel Lescohier
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h 20m
                3h 20m

                Slack

                  Issue deployment